Introduction

PCA is used to plot our dataset to understand new variables that explain most of its variability by using a linear combination of our variables based on covariance (or correlation matrix). In order to compare variables between them, variables must be at the same scale. One of the most common data transformation before PCA is Z-score normalization. It consists in considering a variable as normal distribution whose mean \(\mu\) will be equal to 0 and the standard deviation \(\sigma\) is equal. While this normalization allows the comparison between variables whose scales and measure units are different, it can induce skewness if the variables are at the same scale and have been measured from the same instrument. Therefore, the scaling to \(\sigma = 1\) might look arbitrary considering the instruments precision, depth of signal, etc… A common other transformation is the logarithm. Its properties allow a “stretching” of variables whose mean and standard deviation are very low and “flatten” extreme values. It induces “symmetry” in our data (Gaussian-like distribution).

Such transformations might be tempting because normal distribution is very easy to interpret and work with, but there is a special case in which statistics might be skewed: compositional data. Compositional data is defined also as closed data. The variables are expressed as fractions, proportions and are under the following constraint: the total sum is always equal to a constant number (1 for proportion, 100 for percentages). Karl Pearson (1897) warned about the spurious correlation that can result from compositional data as the variables are under the constant sum constrain. Moreover, if we consider a vector of proportion called x, \(x = (x1, ....,xd)\) of dimension \(d\), we can express it in \(d-1\) dimension because we need to have at least \(d-1\) elements, the last one can be inferred. Those spurious correlation can skew the whole process of PCA as it is based on covariance matrix. Under such constraints, the significance of correlation/covariance values may be misleading. Aitchison (Biometrika,1982,1983,1986) worked on compositional data and introduced several concepts, including the Centered-Log Ratio (CLR) transformation in order to break the constraints and do a correct PCA.

In many fields such as geology, ecology, compositional data is usually the only kind of data scientists can work with. Thus they are aware of those statistical issues and the transformations their data should go through. Nonetheless, it remains ignored in many fields of biology. It is thus necessary to bridge some fields, because they might have common attributes, methods to share.

Our research focuses on the cellular landscape of tumor micro-environment and in order to find recurring immune patterns. We behave like ecologists counting species of cells, covering a large area and therefore, we might gain accuracy using their same methods for analyzing multi-dimensional data. In our case, the dataset is cell percentages identified by CyTOF from 143 breast cancer tumors .

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:plotly':
## 
##     groups
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
## Loading required package: viridisLite

First, we remove the older objects and set the new working directory

setwd("/scratch/anissa.el/ImmuneStates/wagner_analysis")
source ("./scripts/R/Wagner_Keren_functions.r") # Some useful functions for Keren and Wagner Data processing

Pre-processing of the data

MetaData <- read_excel("./data/mmc2.xlsx",col_names=TRUE,skip=1)%>%dplyr::rename(Patient=`Patient ID`)
MetaData<- MetaData%>%mutate(Status = ifelse(`ER Status by IHC`=='P' & `PR Status by IHC`=='P'& `HER2 Status by IHC`=='N',"ER+PR+",
                                             ifelse(`ER Status by IHC`=='P' & `PR Status by IHC`=='N'& `HER2 Status by IHC`=='N',"ER+",
                                                    ifelse(`ER Status by IHC`=='N' & `PR Status by IHC`=='P'& `HER2 Status by IHC`=='N',"PR+",
                                                           ifelse(`ER Status by IHC`=='N' & `PR Status by IHC`=='N'& `HER2 Status by IHC`=='P',"HER2+",
                                                                ifelse(`ER Status by IHC`=='N' & `PR Status by IHC`=='N'& `HER2 Status by IHC`=='N',"TN",
                                                                       ifelse(`ER Status by IHC`=='N' & `PR Status by IHC`=='P'& `HER2 Status by IHC`=='P',"PR+HER2+",
                                                                              ifelse(`ER Status by IHC`=='P' & `PR Status by IHC`=='P'& `HER2 Status by IHC`=='P',"HER2+","Unknown")))  )))))

#Check the tumors for which patients received a Neoadjuvant treatment ==> this may interfere with the proportions of T cells in the sample:
NeoAdjuvant <- MetaData%>%filter(`Neoadjuvant Therapy Received` =="yes")

This is the composition of BC samples : 54 luminal A, 71 luminal B, six luminal B-HER2+, one HER2+, and six TN tumors (144 in total). Let’s group and summarize the found percentages:

# FIrst, we remove the old objects stored in the R environment

CellsProp <-read_excel("./data/mmc5.xlsx",col_names=TRUE,skip=2)

#colnames(CellsProp)
CellsPropBC <- CellsProp%>%filter(`LiveCells [%]`>50)#CellsProp%>%filter(Tissue=="Tumor")%>%filter(`LiveCells [%]`>50)

#CellsPropBC <- left_join(CellsPropBC,MetaData,by="Patient")%>%filter(`Neoadjuvant Therapy Received` !="yes")
CellsPropBC <- CellsPropBC%>%
  mutate(LiveCells_tot= `EndothelialCells [%]`+`EpithelialCells [%]`+`Fibroblasts [%]`+`ImmuneCells [%]`+`Other [%]`)

CellsPropBC <- CellsPropBC%>%
  mutate(EndothelialCells=((`EndothelialCells [%]`/LiveCells_tot))*100)%>%
  mutate(EpithelialCells=((`EpithelialCells [%]`/LiveCells_tot))*100)%>%
  mutate(Fibroblasts=((`Fibroblasts [%]`/LiveCells_tot))*100)%>%
  mutate(ImmuneCells=((`ImmuneCells [%]`/LiveCells_tot))*100)%>%
  mutate(Other=((`Other [%]`/LiveCells_tot))*100)

#Correction to ImmuneCells percentage
CellsPropBC <- CellsPropBC%>%
  mutate(TCells= (`TCells [%]`/100)*ImmuneCells)%>%
  mutate(NK= (`NaturalKillerCells [%]`/100)*ImmuneCells)%>%
  mutate(Granulocytes = (`Granulocytes [%]`/100)*ImmuneCells)%>%
  mutate(BCells= (`BCells [%]`/100)*ImmuneCells)%>%
  mutate(PlasmaCells= (`PlasmaCells [%]`/100)*ImmuneCells)%>%
  mutate(DC= (`plasmacytoidDendriticCells [%]`/100)*ImmuneCells)%>%
  mutate(MyeloidCells= (`MyeloidCells [%]`/100)*ImmuneCells)%>%
  mutate(Basophils= (`Basophils [%]`/100)*ImmuneCells)
## Let's check for the proportion of T cells across the tumor samples
#ggplot(data= CellsPropBC)+
#  geom_boxplot(mapping=aes(y=TCells))+
# labs(title="Boxplot of proportions of T cells across 144 tumor samples (in %)")+
#  ylab("percentage")
## Correcyion of T cells percentage
CellsPropBC <- CellsPropBC%>%
  mutate(sumTcells = `TCells CD4+ [%]`+`TCells CD8+ [%]`)%>%
  mutate(OtherTcells = ((100 - sumTcells)/100)*TCells)%>%
  mutate(`CD4-T` = (`TCells CD4+ [%]`/100)*TCells)%>%
  mutate(`CD8-T` = (`TCells CD8+ [%]`/100)*TCells)

## Correction to myeloid cells percentage
CellsPropBC <- CellsPropBC %>% mutate(Monocytes = (M06 + M15)*MyeloidCells)%>%
  mutate (Macrophages = (M01 + M02 + M03 + M04 + M08 + M09 + M11 + M13 + M14 + M16 + M17)*MyeloidCells)%>%
  mutate(OtherMyeloid = (M05 + M07 + M10 + M12 + M18 + M19)*MyeloidCells)%>%
  mutate(OtherImmune = OtherMyeloid+ OtherTcells+Basophils+Granulocytes+Monocytes)
  #mutate(CD4TCells = (T02+T03+T04+T08+T09+T13+T17+T18+T20)*50)%>%
  #mutate(Tregs = T01*50)%>%
  #mutate(CD8Tcells = (T05 + T06 + T07 + T10 + T11 + T12 + T14 + T15 + T16 + T19)*50)
CellsPropBC2 <- CellsPropBC%>%filter(Tissue=="Tumor")%>%dplyr::select(`Sample ID`,EndothelialCells,EpithelialCells,Fibroblasts,Other,NK,BCells,PlasmaCells,DC,`CD4-T`,`CD8-T`,Macrophages,OtherImmune)%>%
  column_to_rownames(var="Sample ID")

#Select metadata from BC patients
MetaData<- MetaData%>%mutate(Patient=rownames(CellsPropBC2))
StatusTumor <- MetaData%>%dplyr::select(Patient,Status)
AgeResection <-MetaData%>%dplyr::select(Patient,`Age at Surgery`)

AllCellsProps <-  CellsPropBC%>%dplyr::select(`Sample ID`,EndothelialCells,EpithelialCells,Fibroblasts,Other,NK,BCells,PlasmaCells,DC,`CD4-T`,`CD8-T`,Macrophages,OtherImmune)%>%
  column_to_rownames(var="Sample ID")

Data transformations

In this section, we are going to explore some well-known data transformations traditionally used before PCA. Most of them aim to simplify the data as a normal distribution (0,1): from Z-score to log transformation. We’ll see how it impacts PCA (correlations and dimension reduction). Finally, we’ll introduce the CLR transformation suggested by J. Aitchison (Biometrika,1982)

Raw original data

Cells have differential abundance and the mean/sd of each cell type might be very different. THis is due to biology: obviously, cancer cells might be the most abundant ones whereas for certain immune cells, it is well-known that they are not so abundant because only few of them are needed to trigger bigger immune responses.

CellsPropBC2.types <- as_tibble(CellsPropBC2,rownames=NA)%>%
  pivot_longer(cols=colnames(CellsPropBC2),names_to="cell_type",
               values_to="proportion")
ggplot(CellsPropBC2.types,aes(x=cell_type,y=proportion,fill=cell_type))+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=45,hjust=0.8,vjust=0.5),
        legend.title = element_blank())+
  xlab(" ") + ylab("raw proportion")

heatmap(cor(CellsPropBC2),Rowv = NA,Colv = NA,scale="none")

Mean scaling

#CellsPropBC2<- CellsPropBC2%>%filter(Other<=50)
#(2/pi *sqrt(asin(CellsPropBC2/100)))
#library(Rtsne)

cells.means <- apply(CellsPropBC2,MARGIN=2, function(x) mean(x))

CellsPropBC2.types2 <- as_tibble(CellsPropBC2/cells.means,rownames=NA)%>%
  pivot_longer(cols=colnames(CellsPropBC2),names_to="cell_type",
               values_to="proportion")
ggplot(CellsPropBC2.types2,aes(x = cell_type,y = proportion,fill = cell_type))+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=45,hjust=0.8,vjust=0.5),
        legend.title = element_blank())+
  xlab(" ") + ylab("proportion per mean of cell type")

scBCPCA.scaleMean <- dudi.pca(CellsPropBC2/cells.means,scale = FALSE,
                              center = TRUE,scannf = FALSE,nf = 3)
fviz_pca_biplot(scBCPCA.scaleMean,invisible = "ind",repel = TRUE)
## Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

fviz_pca_biplot(scBCPCA.scaleMean,axes=c(1,3),invisible = "ind",repel=TRUE)
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

fviz_eig(scBCPCA.scaleMean)

fig11 <- plot_ly(x = scBCPCA.scaleMean$li$Axis1, 
                y = scBCPCA.scaleMean$li$Axis2,
                z = scBCPCA.scaleMean$li$Axis3,
                color =~CellsPropBC2$`Other`,
                type = "scatter3d", mode = "markers",
                marker = list(symbol = "triangle",size = 4),
                mode = "text")

fig11 <- fig11 %>% layout(scene = list(xaxis = list(title = "PC1"), 
                                       yaxis = list(title = "PC2"), 
                                       zaxis = list(title = "PC3") ),
                        title = "PCA on cell proportions space of BC tumors")
fig11

log2 transformation

#CellsPropBC2<- CellsPropBC2%>%filter(Other<=50)
CellsPropBC2log.types <- as_tibble(log2(CellsPropBC2+1),rownames=NA)%>%pivot_longer(cols=colnames(CellsPropBC2),
                                                  names_to="cell_type",
                                                  values_to="proportion")
ggplot(CellsPropBC2log.types,aes(x=cell_type,y=proportion,fill=cell_type))+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=45,hjust=0.8,vjust=0.5),
        legend.title=element_blank())+
  xlab("")+ylab("log(proportion)")

#(2/pi *sqrt(asin(CellsPropBC2/100)))
#library(Rtsne)
cells.means <- apply(CellsPropBC2,MARGIN=2, function(x) mean(x))

scBCPCA.log <- dudi.pca(log2(CellsPropBC2+1),scale=FALSE,center=TRUE,scannf=FALSE,nf=3)
fviz_pca_biplot(scBCPCA.log,label ="var",col.var="black",col.ind = "indianred2",repel=TRUE)

fviz_pca_biplot(scBCPCA.log,axes=c(1,3),label ="var",col.var="black",col.ind = "indianred2",repel=TRUE)

fviz_eig(scBCPCA.log)

fig12 <- plot_ly(x = scBCPCA.log$li$Axis1, 
                y = scBCPCA.log$li$Axis2,
                z = scBCPCA.log$li$Axis3,
                color =~CellsPropBC2$`Other`,
                type = "scatter3d", mode = "markers",
                marker = list(symbol = "triangle",size = 8),
                mode = "text")

fig12 <- fig12 %>% layout(scene = list(xaxis = list(title = "PC1"), yaxis = list(title = "PC2"), zaxis = list(title = "PC3") ),
                        title = "PCA on cell proportions space of BC tumors")
fig12

Shuffling the cell proportions

# de-correlate the cells using sample() as a row-wise operation (for each tumor)
# For each randomization, we perform an unscaled, centered PCA on the data
# We then collect the list of eigen values for each principal component
# A cumulative sum of variance explained is then computed (in %)
# The results are plotted in a figure with the mean of those shuffled +/- 2 sd
nbShuffles <- 100
nPC <- length(colnames(CellsPropBC2))
EigenVals.rand <- matrix(nrow = nbShuffles,ncol = nPC)
set.seed(15) # for reproducibility
for (i in 1:nbShuffles){
  CellsPropBC.rand <- apply(CellsPropBC2,MARGIN = 2,function(x) sample(x,size = length(x),replace = FALSE)) ## Column-wise operation ==> MARGIN=2
  sumsSamples <- apply(CellsPropBC.rand,MARGIN=1,function(x) sum(x))
  CellsPropBC.rand2 <- (CellsPropBC.rand/sumsSamples)*100
  CellsPropBC.rand.log2 <- log2(CellsPropBC.rand2+1)
  eigs <- dudi.pca(CellsPropBC.rand.log2,scale = FALSE,center = TRUE,nf = 3, scannf = FALSE)$eig
  #print(length(eigs))
  EigenVals.rand[i,]<- cumsum(eigs/sum(eigs)*100)
}

# Original data: PropsBC.CLR2

#### Get mean and +/- sd of the eigen values from shuffles (100 shuffles)
scBC.rand.CumEigs.mean <- apply(EigenVals.rand,MARGIN = 2,function(x) mean(x))
scBC.rand.CumEigs.sdUp <- apply(EigenVals.rand,MARGIN = 2,function(x) mean(x) + 2*sd(x))
scBC.rand.CumEigs.sdLow <- apply(EigenVals.rand,MARGIN = 2,function(x) mean(x) - 2*sd(x))
scBC.CumEigs <- cumsum(scBCPCA.log$eig/sum(scBCPCA.log$eig)*100) # Eigen vlaues of PCA from real tumors (CellsPropBC2 abundance matrix as input)

#Correlation map after randomization
heatmap(cor(CellsPropBC.rand.log2,method = "pearson"),Colv = NA,Rowv = NA,col = viridis(12))

nbComp <- seq(1,nPC,1)
CumEigs.rand <- as_tibble(cbind(scBC.rand.CumEigs.sdUp,scBC.rand.CumEigs.sdLow))%>%
  mutate(nbComp = nbComp)
CumEigs <- as_tibble(cbind(scBC.CumEigs,scBC.rand.CumEigs.mean))%>%
  mutate(nbComp = nbComp)

CumEigs<-CumEigs%>%
  dplyr::rename(`real \ntumors` = scBC.CumEigs)%>%
  dplyr::rename(`shuffled \ntumors` = scBC.rand.CumEigs.mean)%>%
  pivot_longer(cols = c(`real \ntumors`, `shuffled \ntumors`),
               names_to = "groups",
               values_to = "value")

ggplot()+
   geom_line(data = CumEigs,
             aes(x = nbComp,y = value,
                 group = groups,color = groups,
                 linetype = groups))+
   scale_linetype_manual(values = c("solid", "twodash"))+
   scale_color_manual(values = c("black","red"))+
   scale_size_manual(values = c(2, 2))+
   geom_vline(xintercept = 3, linetype = "dotted", 
                 color = "black", size = 0.8)+
  labs(x="Number of Principal Components (PCs)",
       y="% of variance in cellular \ncomposition of tumors")+
  theme(legend.title = element_blank())+
  geom_ribbon(data = CumEigs.rand, aes(x = nbComp,
                                       ymin = scBC.rand.CumEigs.sdLow,
                                       ymax = scBC.rand.CumEigs.sdUp),
              fill = "indianred2",alpha = 0.3)

  #ggsave("./figs/EigsRandWagner.pdf", height = 3, width = 4)

CLR tranformation: prior step to PCA for compositional data

#install.packages("zCompositions")
#install.packages("compositions")
library(zCompositions)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: NADA
## Loading required package: survival
## 
## Attaching package: 'NADA'
## The following object is masked from 'package:stats':
## 
##     cor
## Loading required package: truncnorm
library(compositions)
## Welcome to compositions, a package for compositional data analysis.
## Find an intro with "? compositions"
## 
## Attaching package: 'compositions'
## The following object is masked from 'package:NADA':
## 
##     cor
## The following object is masked from 'package:igraph':
## 
##     normalize
## The following objects are masked from 'package:stats':
## 
##     anova, cor, cov, dist, var
## The following objects are masked from 'package:base':
## 
##     %*%, norm, scale, scale.default
#multLN(CellsPropBC2)

CellsPropBC.corrected <- CellsPropBC2 +1
geom.means <- apply(CellsPropBC.corrected,MARGIN=1,function(x) exp(mean(log(x))))

PropsBC.CLR2 <- clr(CellsPropBC.corrected)#log(CellsPropBC.corrected/geom.means)#t(apply(CellsPropBC.corrected,MARGIN=1,function(x) log(x/geom.means)))
#colnames(PropsBC.CLR) <- colnames(CellsPropBC2)
PropsBC.CLR <- log(CellsPropBC.corrected/geom.means)

plot(x = PropsBC.CLR2$Fibroblasts,
     y = PropsBC.CLR2$EndothelialCells,
     xlab = "Fibroblast (clr(%))",
     ylab = "Endothelial (clr(%))")

plot(x = PropsBC.CLR2$BCells,
     y = PropsBC.CLR2$`CD4-T`,
     xlab="B (clr(%))",
     ylab="CD4 T clr((%))")

PropsBC.CLR.types <- as_tibble(PropsBC.CLR2,rownames = NA)%>%
                                pivot_longer(cols = colnames(PropsBC.CLR2),
                                             names_to = "cell_type",
                                             values_to = "proportion")

ggplot(PropsBC.CLR.types,aes(x = cell_type,
                             y = proportion,
                             fill = cell_type))+
  geom_boxplot()+
  theme(axis.text.x = element_text(angle = 45,hjust = 0.8,vjust = 0.5),
        legend.title = element_blank())+
  xlab("")+ylab("CLR (Centered Log ratio)")

#PropsBC.CLR2.types <- as_tibble(PropsBC.CLR2,rownames=NA)%>%pivot_longer(cols=colnames(PropsBC.CLR2),
#                                                  names_to="cell_type",
#                                                  values_to="proportion")
#ggplot(PropsBC.CLR2.types,aes(x=cell_type,y=proportion,fill=cell_type))+
#  geom_boxplot()

#heatmap(cov(PropsBC.CLR),colV=NA,Rowv = NA)

#rowSums(PropsBC.CLR)
scBCPCA.CLR <- dudi.pca(PropsBC.CLR2,scale=FALSE,center=TRUE,scannf=FALSE,nf=3)
write.csv(as.matrix(scBCPCA.CLR$li), "/scratch/anissa.el/ImmuneStates/wagner_analysis/data/3PCA_Wagner_CLR.csv",row.names=FALSE)


fviz_pca_biplot(scBCPCA.CLR,label ="var",col.var="black",col.ind = "indianred2",repel=TRUE)

fviz_pca_biplot(scBCPCA.CLR,axes=c(1,3),col.var="black",label="var",col.ind = "indianred2",repel=TRUE)

fviz_pca_biplot(scBCPCA.CLR,axes=c(2,3),col.var="black",label="var",col.ind = "indianred2",repel=TRUE)

fviz_eig(scBCPCA.CLR)

fig12 <- plot_ly(x = scBCPCA.CLR$li$Axis1, 
                y = scBCPCA.CLR$li$Axis2,
                z = scBCPCA.CLR$li$Axis3,
                color =~CellsPropBC2$`Other`,
                type = "scatter3d", mode = "markers",
                marker = list(symbol = "triangle",size = 8),
                mode = "text")
              

fig12 <- fig12 %>% layout(scene = list(xaxis = list(title = "PC1"), yaxis = list(title = "PC2"), zaxis = list(title = "PC3") ),
                        title = "PCA on cell proportions space of BC tumors")
fig12
heatmap(cor(PropsBC.CLR),Rowv = NA,Colv = NA,scale="none")

varExp3PC.clr <- sum( scBCPCA.CLR$eig[1:3]/sum( scBCPCA.CLR$eig))*100

The CLR transformation allows us to do a proper dimension reduction as the first 3 Principal Components explain 73.5% of the total variance.

The tumors with a high number of healthy tissue are clearly separated from the rest of the cloud of points which shows that the CLR transformation was able to point out some aspects of our dataset that were not clearly distinguished when running the analysis on raw proportions.

It is thus possible to proceed to clustering in order to do a primary binary classification: samples that are on the edge of the TME and those that are part of the TME. Then,we can do an archetypal analysis on the tumors that are part of the TME to look for a structure in cellular abundance.

CLustering TME and non-TME samples: K means clustering

set.seed(250)
km.res <- kmeans(as.matrix(scBCPCA.CLR$li), 2, nstart = 4)

fig3 <- plot_ly(x = scBCPCA.CLR$li$Axis1, 
                y = scBCPCA.CLR$li$Axis2,
                z = scBCPCA.CLR$li$Axis3,
                color =~as.factor(km.res$cluster),
                type = "scatter3d", mode = "markers",
                marker = list(symbol = "triangle",size = 8),
                mode = "text")
              

fig3 <- fig3 %>% layout(scene = list(xaxis = list(title = "PC1"), yaxis = list(title = "PC2"), zaxis = list(title = "PC3") ),
                        title = "PCA on cell proportions space of BC tumors")
fig3
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
scBC.TME.PCA <-scBCPCA.CLR[km.res$cluster ==2,] 
write.csv(as.matrix(scBC.TME.PCA$li), "/srv/mfs/hausserlab/anissa.el/ImmuneStates/wagner_analysis/data/3PCA_Wagner_CLR_clust1.csv",row.names=FALSE)

Now, let’s apply an archetypal analysis on the latter PCA. The aim remains the same: finding archetypes that shape the continuum of immunes states in cellular abundance of BC tumors.

import sys
sys.path.insert(1, '/srv/mfs/hausserlab/fabio/data_analysis/src/utils')
from archetypes import ArchetypalAnalysis
import pandas as pd
import numpy as np

# Reading the file of PCA from Wagner dataset
pcData = pd.read_csv("/scratch/anissa.el/ImmuneStates/wagner_analysis/data/3PCA_Wagner_CLR_clust1.csv")
pc3dWagner = np.array(pcData, dtype = "float64")
AA = ArchetypalAnalysis(n_archetypes = 3, 
                          tolerance = 0.0001, 
                          max_iter = 500, 
                          random_state = 1, 
                          C = 0.01, 
                          initialize = 'random',
                          redundancy_try = 30)
AA.fit_transform(pc3dWagner)
## array([[-3.70836735e-01,  8.25623838e-01, -2.27856667e-01],
##        [-6.37514205e-01,  1.22979027e+00,  8.93947178e-01],
##        [ 1.29209505e+00,  1.04823838e+00,  8.03450028e-01],
##        [ 1.03659576e+00,  6.36503326e-01, -4.59905110e-01],
##        [ 1.68020357e+00,  1.01169248e+00,  7.85226049e-01],
##        [ 9.75039166e-01,  9.34119520e-01,  3.97155672e-01],
##        [-1.02009923e-01,  1.17936321e+00,  8.68800792e-01],
##        [ 1.33501160e+00,  9.83412337e-01,  6.23508861e-01],
##        [-1.28764102e-01,  3.07293487e-01, -1.68973912e+00],
##        [-2.09773650e+00,  1.10940621e+00,  2.08327623e-01],
##        [ 5.43344337e-01,  5.80660311e-01, -7.36051030e-01],
##        [-2.54952492e-01,  1.19379742e+00,  8.76006351e-01],
##        [-7.96925173e-01,  8.45389103e-01, -2.67157673e-01],
##        [-1.01475951e+00,  1.20800170e+00,  7.44134366e-01],
##        [-1.09792020e+00,  1.27299973e+00,  9.15459388e-01],
##        [-3.06067171e+00,  1.45761405e+00,  1.00746963e+00],
##        [ 2.11517598e+00,  9.70757650e-01,  7.64819231e-01],
##        [-1.10914281e+00,  1.19546093e+00,  6.85781644e-01],
##        [-8.91017934e-01,  4.23817180e-01, -1.52300791e+00],
##        [ 2.13878127e+00,  9.68591849e-01,  7.63752904e-01],
##        [ 1.64999015e+00,  8.09766219e-01,  1.87510381e-01],
##        [-2.71937035e-01,  9.34891289e-01,  1.14623993e-01],
##        [ 3.30621820e-01,  8.47298964e-01, -4.49926261e-03],
##        [-3.24333069e+00,  1.47305260e+00,  1.01078247e+00],
##        [ 4.09956523e-01,  9.28857347e-01,  2.52725993e-01],
##        [ 1.03357779e+00,  6.35664620e-01, -4.63152731e-01],
##        [ 6.97177177e-01,  9.34325307e-01,  3.34170418e-01],
##        [ 8.68833709e-01,  9.54700314e-01,  4.33165927e-01],
##        [-1.82477583e+00,  8.40316286e-01, -5.17383801e-01],
##        [-9.70858607e-01,  1.03317147e+00,  2.42542578e-01],
##        [-7.32213373e-01,  7.09388380e-01, -6.50716217e-01],
##        [-2.41985137e+00,  1.14258970e+00,  2.31715809e-01],
##        [-2.58794047e+00,  1.41332418e+00,  9.85437909e-01],
##        [ 8.08082415e-01,  5.71961929e-01, -7.01118845e-01],
##        [-1.01554015e+00,  1.09138680e+00,  4.02643373e-01],
##        [-6.95122409e-02,  7.32863863e-01, -4.30577326e-01],
##        [ 2.23990129e-01,  8.69629940e-01,  3.66897705e-02],
##        [ 1.77076858e-01,  8.24539061e-01, -1.06082270e-01],
##        [-1.12809357e+00,  1.22221346e+00,  7.59986391e-01],
##        [-1.01681902e+00,  5.57817745e-01, -1.15933969e+00],
##        [-2.99177130e+00,  1.36084468e+00,  7.39939572e-01],
##        [-3.56368857e-02,  9.19071105e-01,  1.21932776e-01],
##        [ 1.60295936e+00,  9.59427457e-01,  6.14586517e-01],
##        [-6.91192275e-01,  7.75193861e-01, -4.48553029e-01],
##        [ 1.11349940e+00,  6.58238758e-01, -3.78793931e-01],
##        [-1.65621791e+00,  1.32561964e+00,  9.41710426e-01],
##        [ 1.03441602e+00,  6.35921653e-01, -4.62359620e-01],
##        [-2.82671088e-01,  8.01118263e-01, -2.79498362e-01],
##        [ 8.71740939e-01,  1.00123983e+00,  5.70029006e-01],
##        [ 1.87647093e+00,  9.93198102e-01,  7.76000510e-01],
##        [-3.50824978e-01,  7.47103831e-01, -4.53239526e-01],
##        [-8.84627457e-01,  4.20965800e-01, -1.52980247e+00],
##        [ 7.64119641e-01,  5.59581495e-01, -7.47689703e-01],
##        [-1.62594609e+00,  1.15085306e+00,  4.37223929e-01],
##        [ 1.77930369e+00,  8.46348088e-01,  3.23746446e-01],
##        [ 1.00713632e+00,  6.28185490e-01, -4.91014070e-01],
##        [-6.89735537e-01,  3.53114114e-01, -1.68366850e+00],
##        [ 6.89699980e-01,  6.25930334e-01, -5.69994751e-01],
##        [ 9.26231731e-01,  6.05334689e-01, -5.76414845e-01],
##        [ 2.19371711e+00,  9.63402095e-01,  7.61160957e-01],
##        [ 6.61932614e-01,  5.30658235e-01, -8.55277972e-01],
##        [ 1.75932650e+00,  8.40707548e-01,  3.02650705e-01],
##        [ 4.90740870e-01,  4.82312864e-01, -1.03601180e+00],
##        [-2.81782858e-01,  8.16150025e-01, -2.35375847e-01],
##        [ 1.37382790e+00,  7.31776353e-01, -1.04043456e-01],
##        [ 8.23284764e-01,  6.76102778e-01, -3.92690324e-01],
##        [ 2.15833296e+00,  9.66809416e-01,  7.62878146e-01],
##        [-2.24748306e+00,  1.17817050e+00,  3.75267562e-01],
##        [-2.41919186e+00,  1.10544969e+00,  1.23116886e-01],
##        [-2.02384250e+00,  1.19765516e+00,  4.83183955e-01],
##        [ 1.00589239e+00,  8.78572058e-01,  2.41584662e-01],
##        [-7.17123143e-01,  6.69829977e-01, -7.62905318e-01],
##        [ 1.87056095e+00,  8.72117614e-01,  4.20099186e-01],
##        [-7.26128741e-01,  1.01856799e+00,  2.55637187e-01],
##        [ 1.47163724e+00,  7.59421760e-01, -8.88924822e-04],
##        [-1.20875772e+00,  9.36508533e-01, -9.49190648e-02],
##        [-9.72255390e-01,  5.73460739e-01, -1.10322882e+00],
##        [-1.17985605e+00,  5.52650355e-01, -1.21175569e+00],
##        [ 1.09947288e+00,  7.34721548e-01, -1.58051140e-01],
##        [ 7.47297055e-01,  5.54786484e-01, -7.65249978e-01],
##        [ 9.37147231e-01,  7.38120051e-01, -1.85287953e-01],
##        [-1.35567085e-01,  7.68091998e-01, -3.42483062e-01],
##        [-1.54670849e-03,  4.87430095e-01, -1.13341880e+00],
##        [ 1.81166762e+00,  8.55473063e-01,  3.57981017e-01],
##        [ 8.36178326e-01,  5.79903109e-01, -6.71487455e-01],
##        [ 3.84087244e-01,  4.52157032e-01, -1.14844592e+00],
##        [ 9.93641323e-01,  9.58113644e-01,  4.71550217e-01],
##        [-1.31284067e-01,  1.17719072e+00,  8.55832434e-01],
##        [ 1.18530750e+00,  7.04317972e-01, -2.27366452e-01],
##        [-1.50813925e+00,  6.99078458e-01, -8.58079441e-01],
##        [-9.33374729e-01,  4.42717154e-01, -1.47803940e+00],
##        [ 1.52968153e+00,  8.13389519e-01,  1.70331298e-01]])
archetypes_wagner = AA.alfa
pd.DataFrame(AA.archetypes).to_csv("/scratch/anissa.el/ImmuneStates/wagner_analysis/outputs/Coordinates_3PCs_Archetypes_Wagner_CLR_clust1.csv")
pd.DataFrame(archetypes_wagner).to_csv("/scratch/anissa.el/ImmuneStates/wagner_analysis/outputs/Archetypes_3PCA_Wagner_CLR_clust1.csv")
ArchetypesBCWagner.CLR <- as_tibble(read.csv("./outputs/Archetypes_3PCA_Wagner_CLR_clust1.csv",header=TRUE))
ArchetypesBCWagner.CLR <- ArchetypesBCWagner.CLR%>%column_to_rownames(var="X")
rownames(ArchetypesBCWagner.CLR) <- c("ARCH1", "ARCH2","ARCH3")
## Checking that the sum is equal to 1 : probability to belong to one of the archetypes
#colSums(archWPCA)
colnames(ArchetypesBCWagner.CLR) <- rownames(scBC.TME.PCA$li)

ArchsCoord.CLR <- as.matrix(read.csv("./outputs/Coordinates_3PCs_Archetypes_Wagner_CLR_clust1.csv", header=TRUE))

ArchsCoord.CLR <- ArchsCoord.CLR[,-1]
colnames(ArchsCoord.CLR) <- c("ARCH1", "ARCH2","ARCH3")

fig4 <-  plot_ly(x = scBCPCA.CLR$li$Axis1, 
                y = scBCPCA.CLR$li$Axis2,
                z = scBCPCA.CLR$li$Axis3,
                color =~as.factor(km.res$cluster),
                type = "scatter3d", mode = "markers",
                marker = list(symbol = "triangle",size = 8),
                mode = "text")
              

fig4 <- fig4 %>% layout(scene = list(xaxis = list(title = "PC1"), yaxis = list(title = "PC2"), zaxis = list(title = "PC3")),
                        title = "PCA on cell proportions space of BC tumors")

fig4 <- fig4%>%add_trace(x = ArchsCoord.CLR[1,],
                                 y = ArchsCoord.CLR[2,],
                                 z = ArchsCoord.CLR[3,],
                                 type = "scatter3d",mode = "lines+markers+text",
                                 text = c("ARCH2","ARCH1","ARCH3"),#c("arch. 2", "arch. 1","arch. 3","arch. 4"),
                                 textposition = c('top right','top right','top left'),
                                 textfont = list(color = '#000000', size = 16),
                                 showlegend = TRUE,
                                 name = "Archetypes",
                                 marker = list(color = c("darkmagenta","dodgerblue","black"),symbol = "star-diamond",size = 14),
                                 line = list(color = "black", size = 5),
                                 inherit = FALSE)%>%
                        add_trace(x = ArchsCoord.CLR[1,c(3,1,2)],
                                  y = ArchsCoord.CLR[2,c(3,1,2)],
                                  z = ArchsCoord.CLR[3,c(3,1,2)],
                                  type = "scatter3d",mode = "lines",
                                  line = list(color = "black", size = 5),
                                  showlegend = FALSE,
                                  inherit = FALSE)              

#fig12 <- fig12 %>% layout(scene = list(xaxis = list(title = "PC1"), yaxis = list(title = "PC2"), zaxis = list(title = "PC3") ),
#                        title = "PCA on cell proportions space of BC tumors")
fig4
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Random shuffling

# de-correlate the cells using sample() as a row-wise operation (for each tumor)
# For each randomization, we perform an unscaled, centered PCA on the data
# We then collect the list of eigen values for each principal component
# A cumulative sum of variance explained is then computed (in %)
# The results are plotted in a figure with the mean of those shuffled +/- 2 sd
library(zCompositions)
library(compositions)
nbShuffles <- 100
nPC <- length(colnames(CellsPropBC2))-1
EigenVals.rand <- matrix(nrow = nbShuffles,ncol = nPC)
set.seed(15) # for reproducibility
for (i in 1:nbShuffles){
  CellsPropBC.rand <- apply(CellsPropBC2,MARGIN = 2,function(x) sample(x,size = length(x),replace = FALSE)) ## Column-wise operation ==> MARGIN=2
  sumsSamples <- apply(CellsPropBC.rand,MARGIN=1,function(x) sum(x))
  CellsPropBC.rand2 <- (CellsPropBC.rand/sumsSamples)*100
  CellsPropBC.rand.clr <- clr( CellsPropBC.rand2 +1)
  eigs <- dudi.pca(CellsPropBC.rand.clr,scale = TRUE,center = TRUE,nf = 3, scannf = FALSE)$eig
  #print(length(eigs))
  EigenVals.rand[i,]<- cumsum(eigs/sum(eigs)*100)
}

# Original data: PropsBC.CLR2

#### Get mean and +/- sd of the eigen values from shuffles (100 shuffles)
scBC.rand.CumEigs.mean <- apply(EigenVals.rand,MARGIN = 2,function(x) mean(x))
scBC.rand.CumEigs.sdUp <- apply(EigenVals.rand,MARGIN = 2,function(x) mean(x) + 2*sd(x))
scBC.rand.CumEigs.sdLow <- apply(EigenVals.rand,MARGIN = 2,function(x) mean(x) - 2*sd(x))
scBC.CumEigs <- cumsum(scBCPCA.CLR$eig/sum(scBCPCA.CLR$eig)*100) # Eigen vlaues of PCA from real tumors (CellsPropBC2 abundance matrix as input)

#Correlation map after randomization
heatmap(cor(CellsPropBC.rand.clr,method = "pearson"),Colv = NA,Rowv = NA,col = viridis(12))

nbComp <- seq(1,nPC,1)
CumEigs.rand <- as_tibble(cbind(scBC.rand.CumEigs.sdUp,scBC.rand.CumEigs.sdLow))%>%
  mutate(nbComp = nbComp)
CumEigs <- as_tibble(cbind(scBC.CumEigs,scBC.rand.CumEigs.mean))%>%
  mutate(nbComp = nbComp)

CumEigs<-CumEigs%>%
  dplyr::rename(`real \ntumors` = scBC.CumEigs)%>%
  dplyr::rename(`shuffled \ntumors` = scBC.rand.CumEigs.mean)%>%
  pivot_longer(cols = c(`real \ntumors`, `shuffled \ntumors`),
               names_to = "groups",
               values_to = "value")

ggplot()+
  geom_line(data = CumEigs,
            aes(x = nbComp,y = value,
                group = groups,color = groups,
                linetype = groups))+
  scale_linetype_manual(values = c("solid", "twodash"))+
  scale_color_manual(values = c("black","red"))+
  scale_size_manual(values = c(2, 2))+
  geom_vline(xintercept = 3, linetype = "dotted", 
                color = "black", size = 0.8)+
  labs(x="Number of Principal Components (PCs)",
       y="% of variance in cellular \ncomposition of tumors")+
  theme(legend.title = element_blank())+
  geom_ribbon(data = CumEigs.rand, aes(x = nbComp,
                                       ymin = scBC.rand.CumEigs.sdLow,
                                       ymax = scBC.rand.CumEigs.sdUp),
              fill = "indianred2",alpha = 0.3)

  #ggsave("./figs/EigsRandWagner.pdf", height = 3, width = 4)
library(zCompositions)
library(compositions)

CellsBC.ilr <- ilr(CellsPropBC2)

scBCPCA.ILR <- dudi.pca(CellsBC.ilr,scale=FALSE,center=TRUE,scannf=FALSE,nf=3)
#write.csv(as.matrix(scBCPCA.CLR$li), "/scratch/anissa.el/ImmuneStates/wagner_analysis/data/3PCA_Wagner_CLR.csv",row.names=FALSE)


fviz_pca_biplot(scBCPCA.ILR,label ="var",col.var="black",col.ind = "indianred2",repel=TRUE)

fviz_pca_biplot(scBCPCA.ILR,axes=c(1,3),col.var="black",label="var",col.ind = "indianred2",repel=TRUE)

fviz_pca_biplot(scBCPCA.ILR,axes=c(2,3),col.var="black",label="var",col.ind = "indianred2",repel=TRUE)

fviz_eig(scBCPCA.ILR)

fig3 <- plot_ly(x = scBCPCA.ILR$li$Axis1, 
                y = scBCPCA.ILR$li$Axis2,
                z = scBCPCA.ILR$li$Axis3,
                color =~CellsPropBC2$`Other`,
                type = "scatter3d", mode = "markers",
                marker = list(symbol = "triangle",size = 8),
                mode = "text")
              

fig3 <- fig3 %>% layout(scene = list(xaxis = list(title = "PC1"), yaxis = list(title = "PC2"), zaxis = list(title = "PC3") ),
                        title = "PCA on cell proportions space of BC tumors")
fig3